29 research outputs found

    Examining Information on Social Media: Topic Modelling, Trend Prediction and Community Classification

    Get PDF
    In the past decade, the use of social media networks (e.g. Twitter) increased dramatically becoming the main channels for the mass public to express their opinions, ideas and preferences, especially during an election or a referendum. Both researchers and the public are interested in understanding what topics are discussed during a real social event, what are the trends of the discussed topics and what is the future topical trend. Indeed, modelling such topics as well as trends offer opportunities for social scientists to continue a long-standing research, i.e. examine the information exchange between people in different communities. We argue that computing science approaches can adequately assist social scientists to extract topics from social media data, to predict their topical trends, or to classify a social media user (e.g. a Twitter user) into a community. However, while topic modelling approaches and classification techniques have been widely used, challenges still exist, such as 1) existing topic modelling approaches can generate topics lacking of coherence for social media data; 2) it is not easy to evaluate the coherence of topics; 3) it can be challenging to generate a large training dataset for developing a social media user classifier. Hence, we identify four tasks to solve these problems and assist social scientists. Initially, we aim to propose topic coherence metrics that effectively evaluate the coherence of topics generated by topic modelling approaches. Such metrics are required to align with human judgements. Since topic modelling approaches cannot always generate useful topics, it is necessary to present users with the most coherent topics using the coherence metrics. Moreover, an effective coherence metric helps us evaluate the performance of our proposed topic modelling approaches. The second task is to propose a topic modelling approach that generates more coherent topics for social media data. We argue that the use of time dimension of social media posts helps a topic modelling approach to distinguish the word usage differences over time, and thus allows to generate topics with higher coherence as well as their trends. A more coherent topic with its trend allows social scientists to quickly identify the topic subject and to focus on analysing the connections between the extracted topics with the social events, e.g., an election. Third, we aim to model and predict the topical trend. Given the timestamps of social media posts within topics, a topical trend can be modelled as a continuous distribution over time. Therefore, we argue that the future trends of topics can be predicted by estimating the density function of their continuous time distribution. By examining the future topical trend, social scientists can ensure the timeliness of their focused events. Politicians and policymakers can keep abreast of the topics that remain salient over time. Finally, we aim to offer a general method that can quickly obtain a large training dataset for constructing a social media user classifier. A social media post contains hashtags and entities. These hashtags (e.g. "#YesScot" in Scottish Independence Referendum) and entities (e.g., job title or parties' name) can reflect the community affiliation of a social media user. We argue that a large and reliable training dataset can be obtained by distinguishing the usage of these hashtags and entities. Using the obtained training dataset, a social media user community classifier can be quickly achieved, and then used as input to assist in examining the different topics discussed in communities. In conclusion, we have identified four aspects for assisting social scientists to better understand the discussed topics on social media networks. We believe that the proposed tools and approaches can help to examine the exchanges of topics among communities on social media networks

    Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data

    Get PDF
    Scholars often seek to understand topics discussed on Twitter using topic modelling approaches. Several coherence metrics have been proposed for evaluating the coherence of the topics generated by these approaches, including the pre-calculated Pointwise Mutual Information (PMI) of word pairs and the Latent Semantic Analysis (LSA) word representation vectors. As Twitter data contains abbreviations and a number of peculiarities (e.g. hashtags), it can be challenging to train effective PMI data or LSA word representation. Recently, Word Embedding (WE) has emerged as a particularly effective approach for capturing the similarity among words. Hence, in this paper, we propose new Word Embedding-based topic coherence metrics. To determine the usefulness of these new metrics, we compare them with the previous PMI/LSA-based metrics. We also conduct a large-scale crowdsourced user study to determine whether the new Word Embedding-based metrics better align with human preferences. Using two Twitter datasets, our results show that the WE-based metrics can capture the coherence of topics in tweets more robustly and efficiently than the PMI/LSA-based ones

    On the Reproducibility and Generalisation of the Linear Transformation of Word Embeddings

    Get PDF
    Linear transformation is a way to learn a linear relationship between two word embeddings, such that words in the two different embedding spaces can be semantically related. In this paper, we examine the reproducibility and generalisation of the linear transformation of word embeddings. Linear transformation is particularly useful when translating word embedding models in different languages, since it can capture the semantic relationships between two models. We first reproduce two linear transformation approaches, a recent one using orthogonal transformation and the original one using simple matrix transformation. Previous findings on a machine translation task are re-examined, validating that linear transformation is indeed an effective way to transform word embedding models in different languages. In particular, we show that the orthogonal transformation can better relate the different embedding models. Following the verification of previous findings, we then study the generalisation of linear transformation in a multi-language Twitter election classification task. We observe that the orthogonal transformation outperforms the matrix transformation. In particular, it significantly outperforms the random classifier by at least 10% under the F1 metric across English and Spanish datasets. In addition, we also provide best practices when using linear transformation for multi-language Twitter election classification

    On Refining Twitter Lists as Ground Truth Data for Multi-Community User Classification

    Get PDF
    To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from existing public Twitter lists, where those lists represent different communities, e.g. a list of journalists. However, ground truth datasets obtained using such lists can be noisy, since not all users that belong to a community are good training examples for that community. In this paper, we conduct a thorough failure analysis of a ground truth dataset generated using Twitter lists. We discuss how some categories of users collected from these Twitter public lists could negatively affect the classification performance and therefore should not be used for training. Through experiments with 3 classifiers and 5 communities, we show that removing ambiguous users based on their tweets and profile can indeed result in a 10% increase in F1 performance

    Topic-centric Classification of Twitter User's Political Orientation

    Get PDF
    In the recent Scottish Independence Referendum (hereafter, IndyRef), Twitter offered a broad platform for people to express their opinions, with millions of IndyRef tweets posted over the campaign period. In this paper, we aim to classify people's voting intentions by the content of their tweets---their short messages communicated on Twitter. By observing tweets related to the IndyRef, we find that people not only discussed the vote, but raised topics related to an independent Scotland including oil reserves, currency, nuclear weapons, and national debt. We show that the views communicated on these topics can inform us of the individuals' voting intentions ("Yes"--in favour of Independence vs. "No"--Opposed). In particular, we argue that an accurate classifier can be designed by leveraging the differences in the features' usage across different topics related to voting intentions. We demonstrate improvements upon a Naive Bayesian classifier using the topics enrichment method. Our new classifier identifies the closest topic for each unseen tweet, based on those topics identified in the training data. Our experiments show that our Topics-Based Naive Bayesian classifier improves accuracy by 7.8% over the classical Naive Bayesian baseline

    Analysing political events on Twitter: topic modelling and user community classification

    Get PDF
    Recently, political events, such as elections or referenda, have raised a lot of discussions on social media networks, in particular, Twitter. This brings new opportunities for social scientists to address social science tasks, such as understanding what communities said, identify- ing whether a community has an influence on another or analysing how these communities respond to political events online. However, identifying these communities and extracting what they said from social media data are challenging and non-trivial tasks. In this thesis, we aim to make progress towards understanding ‘who’ (i.e. communities) said ‘what’ (i.e. discussed topics) and ‘when’ (i.e. time) during political events on Twitter. While identifying the ‘who’ can benefit from Twitter user community classification approaches, ‘what’ they said and ‘when’ can be effectively addressed on Twitter by extracting their discussed topics using topic modelling approaches that also account for the importance of time on Twitter. To evaluate the quality of these topics, it is necessary to investigate how coherent these topics are to humans. Accordingly, we propose a series of approaches in this thesis. First, we investigate how to effectively evaluate the coherence of the topics generated using a topic modelling approach. The topic coherence metric evaluates the topical coherence by examining the semantic similarity among words in a topic. We argue that the semantic similarity of words in tweets can be effectively captured by using word embeddings trained using a Twitter background dataset. Through a user study, we demonstrate that our proposed word embedding-based topic coherence metric can assess the coherence of topics like humans. In addition, inspired by the precision at k information retrieval metric, we propose to evaluate the coherence of a topic model (containing many topics) by averaging the top-ranked topics within the topic model. Our proposed metrics can not only evaluate the coherence of topics and topic models, but also can help users to choose the most coherent topics. Second, we aim to extract topics with a high coherence from Twitter data. Such topics can be easily interpreted by humans and they can assist to examine ‘what’ has been discussed on Twitter and ‘when’. Indeed, we argue that topics can be discussed in different time periods and therefore can be effectively identified and distinguished by considering their time periods. Hence, we propose an effective time-sensitive topic modelling approach by integrating the time dimension of tweets (i.e. ‘when’). We show that the time dimension helps to generate topics with a high coherence. Hence, we argue that ‘what’ has been discussed and ‘when’ can be effectively addressed by our proposed time-sensitive topic modelling approach. Next, to identify ‘who’ participated in the topic discussions, we propose approaches to identify the community affiliations of Twitter users, including automatic ground-truth generation approaches and a user community classification approach. To generate ground-truth data for training a user community classifier, we show that the mentioned hashtags and entities in the users’ tweets can indicate which community a Twitter user belongs to. Hence, we argue that they can be used to generate the ground-truth data for classifying users into communities. On the other hand, we argue that different communities favour different topic discussions and their community affiliations can be identified by leveraging the discussed topics. Accordingly, we propose a Topic-Based Naive Bayes (TBNB) classification approach to classify Twitter users based on their words and discussed topics. We demonstrate that our TBNB classifier together with the ground-truth generation approaches can effectively identify the community affiliations of Twitter users. Finally, to show the generalisation of our approaches, we apply our approaches to analyse 3.6 million tweets related to US Election 2016 on Twitter. We show that our TBNB approach can effectively identify the ‘who’, i.e. classify Twitter users into communities by using hashtags and the discussed topics. To investigate ‘what’ these communities have discussed, we apply our time-sensitive topic modelling approach to extract coherent topics. We finally analyse the community-related topics evaluated and selected using our proposed topic coherence metrics. Overall, we contribute to provide effective approaches to assist social scientists towards analysing political events on Twitter. These approaches include topic coherence metrics, a time-sensitive topic modelling approach and approaches for classifying the community affiliations of Twitter users. Together they make progress to study and understand the connections and dynamics among communities on Twitter

    News and information leadership in the digital age

    Get PDF
    This paper examines information networks on social media to draw conclusions about influence relationships among members of the mass media. The project considers social networks and information patterns using Twitter data, first at the newspaper level and second at the journalist level. Using a computational approach, we look for evidence of elite-directed information flows, as well as exploring whether we find evidence of an increase in the democratization of newsmaking. This study finds that elite voices continue to dominate information networks in the digital age; however, it also finds evidence that information can move expeditiously from journalists in local and regional outlets to elite ones, and vice versa. We move further to explore the content of tweets among the journalist network, finding that there are substantial, direct interactions among elite and regional and local journalists. Our results taken together uncover new network patterns and provide a novel insight on the role of information technologies in newsmaking in the digital age

    Votes on Twitter: assessing candidate preferences and topics of discussion during the 2016 U.S. presidential election

    Get PDF
    Social media offers scholars new and innovative ways of understanding public opinion, including citizens' prospective votes in elections and referenda. We classify social media users' preferences over the two U.S. presidential candidates in the 2016 election using Twitter data and explore the topics of conversation among proClinton and proTrump supporters. We take advantage of hashtags that signaled users' vote preferences to train our machine learning model which employs a novel classifier-a Topic Based Naive Bayes model-that we demonstrate improves on existing classifiers. Our findings demonstrate that we are able to classify users with a high degree of accuracy and precision. We further explore the similarities and divergences among what proClinton and proTrump users discussed on Twitter

    Follow-on Question Suggestion via Voice Hints for Voice Assistants

    Full text link
    The adoption of voice assistants like Alexa or Siri has grown rapidly, allowing users to instantly access information via voice search. Query suggestion is a standard feature of screen-based search experiences, allowing users to explore additional topics. However, this is not trivial to implement in voice-based settings. To enable this, we tackle the novel task of suggesting questions with compact and natural voice hints to allow users to ask follow-up questions. We define the task, ground it in syntactic theory and outline linguistic desiderata for spoken hints. We propose baselines and an approach using sequence-to-sequence Transformers to generate spoken hints from a list of questions. Using a new dataset of 6681 input questions and human written hints, we evaluated the models with automatic metrics and human evaluation. Results show that a naive approach of concatenating suggested questions creates poor voice hints. Our approach, which applies a linguistically-motivated pretraining task was strongly preferred by humans for producing the most natural hints.Comment: Accepted as Long Paper at EMNLP'23 Finding

    An Effective Approach for Modelling Time Features for Classifying Bursty Topics on Twitter

    Get PDF
    Several previous approaches attempted to predict bursty topics on Twitter. Such approaches have usually reported that the time information (e.g. the topic popularity over time) of hashtag topics contribute the most to the prediction of bursty topics. In this paper, we propose a novel approach to use time features to predict bursty topics on Twitter. We model the popularity of topics as density curves described by the density function of a beta distribution with different parameters. We then propose various approaches to predict/classify the bursty topics by estimating the parameters of topics, using estimators such as Gradient Decent or Likelihood Maximization. In our experiments, we show that the estimated parameters of topics have a positive effect on classifying bursty topics. In particular, our estimators when combined together improve the bursty topic classification by 6.9 in terms of micro F1 compared to a baseline classifier using hashtag content features
    corecore